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Abstract 

We study the mixtures of factorizing probability distributions represented as visi- 
ble marginal distributions in stochastic layered networks. We take the perspective 
of kernel transitions of distributions, which gives a unified picture of distributed 
representations arising from Deep Belief Networks (DBN) and other networks 
without lateral connections. We describe combinatorial and geometric properties 
of the set of kernels and products of kernels realizable by DBNs as the network 
parameters vary. We describe explicit classes of probability distributions, includ- 
ing exponential families, that can be learned by DBNs. We use these submodels 
to bound the maximal and the expected Kullback-Leibler approximation errors of 
DBNs from above depending on the number of hidden layers and units that they 
contain. 



1 Introduction 

Deep belief networks (DBNs) are a kind of learning machine introduced originally in flOl . They are 
used to extract features from data, often by an unsupervised pretraining step, so their properties as 
generative models and their expressive power are also of interest, see El E3 [TT1 |T5]|. A DBN can 
be seen as a concatenation of modules that implement kernel transitions (stochastic linear maps) of 
probability vectors. We describe this perspective in Section [2] and the geometry and combinatorics 
of the set of kernels that DBNs can represent, in Section[3] See Figure[T] 

The deep belief network probability model DBN(n , ni, . . . , ni) with layers of widths no, . . . , n\ is 
the set of marginals P(h°) = E^ G {o,i}-i 1 " " £fc'e{o,i}»i p ( h °> h\...,h l ) for all h° e {0, l} n °, 
of all joint probability distributions on the states of a layered network. The top layer has bipar- 
tite undirected connections, with subsequent layers bipartite and downward-directed, giving joint 
unmarginalized probabilities: 

i-i 

p(h°, h\ . . . , h i ) = ( n p(h k - i \h k ))p(h i -\ h i ) , (i) 

k=i 

for all 0°, ...,h l )e {0, l} n ° x • • • x {0, l} ni , where 

P(h l -\h l ) = ^e*v{h l B l + h l W l h l ~ 1 + B l ~ 1 h 1 - 1 ) , and (2) 

P(h k ~ l \h k ) = - 1 - exp {h k W k h k ~ l + B k ~ l h k ~ l ) . (3) 
Z h k 

Here h k = (h k , ... , h k k ) e {0, l} nfc denotes the states of the units in the kth layer; W k <G 
l>n fc xn fc _i is a matrix of connection weights between units from the kth and (k — l)th layer; 
B k G M nfc is a vector of bias weights of the units in the kth layer; Z = ^2 h i-i h i exp(/i z VF z /i z_1 + 
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Figure 1: Left: A network module that realizes stochastic transitions /C m>n from the set of distribu- 
tions M C A 2 ™_i on the top layer, to probability distributions M • JC m n C A 2 ^_i on the bottom 



layer, see eq d8). Right: The kernels Kw,b 
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described in Proposition |4| 



B l 1 h l 1 + b l h l ) is a normalization constant that depends on W\B l and Z^/c+i = 

J^^fc exp(h k+1 W k+1 h k + B k h k ) is a normalization constant that depends on W k+1 ,B k , and h k+1 . 

The total number of parameters of this model is d = (J2k=i n k-i n k) + Efc=o n *0' treating the 
layer widths no, . . . , ni as hyperparameters. 

A restricted Boltzmann machine (RBM) J22l 19) is formally the same as a DBN with only one 
hidden layer. The model RBM n;Tn = DBN(n, m) is the set of probability distributions on {0, l} n 

of the form P(v) = \ E/*e{o,i}'- ex P ( hWv + Ch + for a11 v G {°> i} 7 "- 

We denote by A 2 ™_i the simplex of probability distributions on {0, l} n . Its vertices are the point 
measures 5 X , x G {0, l} n . 

Sutskever and Hinton l23l showed that a very deep and narrow DBN, with ~ 3 • 2 n hidden layers of 
width (n+ 1), can approximate any distribution on {0, l} n arbitrarily well. Le Roux and Bengio ifTTl 



improved this bound showing that 
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layers of width n suffice. Montufar and Ay JT5I improved 



that bound again to ~ We are interested in the expressive power of DBNs which have less than 
2 n — 1 parameters and cannot approximate every probability distribution arbitrarily well. In (TTSl the 
maximal Kullback-Leibler approximation errors of RBMs were bounded from above by studying 
submodels of RBMs. 



Definition 1. A submodel of a DBN with layer widths n 0; 
in Zao^o — i contained in DBN(n , ...,ni). 



ni is a set of probability distributions 



Approaches to find explicit submodels of DBNs include studying 

• The set DBN (no, . . . , n{) as a mixture of conditional distributions with mixing distributions from 
the imbedded model DBN(ni, . . . , ni). This approach was proposed in fT3l and used in fT6l to 
study the expressive power of RBMs. In Section [2] we describe distributed mixtures of product 
distributions arising in layered networks. 

• Models arising from probability sharing on RBMs. This idea has been used in ||23l [TT] Q2) 
to study universal approximation of probability distributions by DBNs. To study submodels of 
DBNs, one imposes constraints on the number and type of sharing steps (the number and widths 
of the hidden layers). The submodels are sub-simplicial-complexes of A 2 ™_i. In Section [3T2 



we discuss certain faces of the probability simplex that can be represented by deep and narrow 
DBNs. 

• The set of joint probability distributions on the states of all units of a DBN and their linear 
projections (by marginalization maps). 

• Graphical submodels of the DBN such as RBMs and trees. 

Understanding these items is helpful to lower bound the capabilities of deep belief networks. 

The marginal probability distributions on the states of the visible units of a stochastic network with 
no direct connections between visible units, are mixtures of product distributions. We call a mixture 
distributed when the mixture components share parameters in some way. Distributed representations 
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Figure 2: Left: Two-bit product distributions with straight lines aB, a £ R as natural parameters 
for various choices of 5 G R 2 . Middle: Linear projection of the left figure into the convex support 
of the two-bit independence model. Right: Linear projection of 10 (2, 2)-zonoset tuples of product 
distributions with random W £ R 2x2 , B £ R 2 . 



have been discussed in I81H1H5I. Each layer of a DBN defines a distributed mixture of product 
distributions. Similarly, each layer of a deep Boltzmann machine (DBM) and a directed RBM define 
a distributed mixture of product distributions. A DBM is a layered network with undirected bipartitie 
connections between units in subsequent layers, see l20l . The DBM model is the set of marginal 
distributions on the states of the variables in the bottom layer. The model RBM^ m is the set of 
visible distributions of a pair of layers of binary units with directed connections from the top layer 
to the bottom layer, including top and bottom bias weights, and without connections within each 
layer, as shown in Figure [T] 

In Section [2] we discuss the mixtures of product distributions represented by layered networks. In 
Section [3] we study the geometry of the set of all stochastic transitions that can be realized by DBN 
layers. In Section [4] we derive upper bounds on the maximal and mean approximation errors of 
DBNs. Section [5] presents a discussion of our results. All formal proofs of mathematical statements 
are deferred to the Appendix. 

2 Distributed mixtures of products and stochastic kernels 

An exponential family is a set of probability distributions of the form Ey = {p oc exp(/) : / £ V}, 
where V is an affine space of functions on the set of elementary events. The set of all strictly 
positive product distributions of n binary variables is an n-dimensional exponential family, de- 
noted by M n Q A 2 n_i, with elements p B {v\, ■■■ ,v n ) = Yl™ =1 PBi(vi) = exp(Bv)/Z B , 
Z B = J2ve{o i} n e w(Bv). Here B £ R n is called the natural parameter vector of p B . The convex 
support of this model is an n- dimensional hypercube with points in one-to-one correspondence with 
the points in the closure M n of M n . See J5). 

The k-mixture of product distributions of n binary variables is 

k k 

M n , k := {J2 X jP U) : P U) G M ^ X j > Vj, and ^ \ 3 = 1} . (4) 

3=1 3=1 

This set has the dimension expected from counting parameters, dim(7W n ,m) = min{2 n — 1, mn + 
m — 1}, unless n = 4 and m = 3, see HI . 

The marginal visible probability distributions of DBNs, DBMs, directed RBMs, and RBMs with n 
binary visible units and m binary units in the first hidden layer, all have the following form: 

p(v)= Phw +B (v)q(h) W£{0,iy\ where (5) 

he{o,i} m 

p hW +B(v) = ^exp((hW + B)v) W£{0,ir, V/i £ {0, l} m , (6) 

with Z h = E vG {o,i}- Gxp((hW + B)v), W £ R mXn , B £ R n , and q is a probability distribution 
on/i£{0,l} m . 
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Figure 3: The setu- JC h2 = {u-K WjB : W G R lx2 , B G R lx2 } C A 3 , where u = (1/2,1/2). 

The natural parameters Z = {hW + B: h G {0, l} m }, with W G M mXn and B e M n , of the 
2 m product distributions {phw+b '■ h G {0, l} m }, are a multiset (a set with repetitions allowed) of 
points in R n called an (ra, ri)-zonoset. In the literature of polytopes the convex hull of a zonoset is 
known as zonotope. 

Definition 2. We call {phw+b '■ h G {0, l} m } /7ze zonoset tuple of product distributions associated 
to the zonoset Z = {hW + B : h G {0, l} m }. 

The number of parameters of a zonoset tuple is (ra + l)n, while 2 m n parameters are needed for 
describing an arbitrary tuple of 2 m product distributions. Any (ra, n)-zonoset-tuple of product dis- 
tributions is contained in an exponential subfamily of M n of dimension min{ra,n}. Figure [2] 
illustrates zonoset tuples of product distributions on {0, l} 2 . 

We can view eq. §5§ as a transition of the marginal distribution q on the states of the first hidden 
layer, to the visible distribution p, by a stochastic kernel: 

p = q- K W , B , (7) 
where the kernel, called an (ra, n)-zonoset kernel, is defined by the 2 m x 2 n -matrix with entries 

K w A h i v ) : = Phw+B(v) for all h G {0, l} m and all v G {0, l} n . (8) 

Thus a zonoset tuple is the rows of a zonoset kernel viewed as a set. Each Kw,b is a (row) stochastic 
matrix describing a linear map 

K WjB ■ A 2 m_! conv{K w ,s(^, •)}& ^ A 2 -_i ; p- K WjB • 

We denote the set of all (ra, n) -zonoset kernels by 

JC m ,n ■= {K w ,b : W G R mXr \ 

We write /C m;n for the set of all kernels that can be expressed as the limit of a sequence K W .^ B . G 

fcm,n, i G N. 

The iwpwf distributions </ in eq. ([5]) are restricted in different ways for each model: 

• For DBNs q G DBN(ni, . . . ,71;), and DBN(n , . . . ,ni) = DBN(ni, . . . , raj • /C ni>no ; in par- 
ticular a DBN with layers of constant width is given by RBM n;n 

• For directed RBMs q G M m , and RBM^ k m = M m ■ /C m , n . 

• For RBMs q £ {j, ^ v exp((hW + 5)v + C7i) : C G R m }. 

• For DBMs qe E^,...,^ nL=\ exp((/i fc+1 V^ fc+1 + £ fc )/* fc ) exp(£^)}. 

In the case of RBMs and DBMs g is subject to "feedback" from the visible units and depends on W 
and B, while for DBNs and directed RBMs q is independent from these parameters. The 2 m product 
distributions Phw+b, h G {0, l} m , which we summarized in the rows of K WjB , however are the 
same for all these models. The smallest model which contains all models of the form M • JC m n 
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is the (m, n)-zonoset mixture of products (ZMP), defined by ZMP nm := A 2 ™_;l • /C m?n , or more 
explicitly: 



DBNs and DBMs are "cut out" from ZMPs by their specific constraints on the mixture weights q(h). 
The mixture weights q of DBN(n , ni, . . . , n{) can be chosen arbitrarily and the model is equal to 
ZMP n0;ni only if DBN(ni, . . . , nj) is a universal approximator on {0, l} ni . 

ZMPs are submodels of very large mixtures of products; ZMP n;Tn C M n ,2 m , for all n and m. By 
results from jTGj . M n ,2™ is also the smallest mixture of products that contains RBM n;m and thus 
ZMP n;7n , when 4[m/3] <n. On the other hand, each zonoset tuple shares the parameters W and 
B, and the largest mixture of products contained in a ZMP is possibly relatively small. The total 
number of parameters of ZMP n;Tn is (m + l)n + 2 m — 1. We note that M n ,m+i Q ZMP n;m for 
all n and ra, and M n ,k 2 ZMP n;m when k > ^^~^y^ (by counting parameters). 

Example 3. If the input q is a point measure 6h, then the output is just the hth row q • Kv^s = 
Pw+ b of the kernel. In particular 5h • /C m;n = M n for any /i. If the input is the uniform distribution 
u on {0, l} m , then the output p = q ■ K W ^ B is the arithmetic mean of a zonoset tuple. Figure [5] 
illustrates this set for one hidden and two visible units. 

3 Geometry and combinatorics of zonoset kernels 

A face or a cylinder set of the n-cube is a maximal set of binary vectors of length n with fixed 
values in a set of coordinates I C [n]. We write [/ij] = {h G {0, l} n : /ij — /ij} for the (n — |/|)- 
dimensional face with fixed values hi = h\ for all i G I. We write a 2 6 for a + b mod 2. Given 
a vector h G {0, l} m and a subset / c [m], we write hj for a vector in {0, 1} J , or for the vector 
with entries (hj)i = hi if % G / and {hi)i — if i £ I. The support of a probability distribution p 
defined on a set X is supp(p) := {x G : > 0}. 

We start showing that certain classes of kernels can be realized as zonoset kernels. Let n = m. 
Given any p G A 2 ™_i, let K p (h, v) := p(h 2 v). The rows of K p are permuted versions of the 
probability distribution p. Figure [T] illustrates the set of all kernels K p with p uniformly distributed 
on faces of {0, l} 3 . The mixing times of these kernels have been studied in the context of Markov 
chains on finite groups, see EH . 

Proposition 4. Let p be any product distribution with support on any face of {0, l} n with fixed 
coordinates I C [n], and let K p (h, v) = p(h 2 v)for v, h G {0, l} n . Then there is a zonoset kernel 
Kw,b £ fcn,n with Kw,B(h, v) = K p (h, hjc 2 v)for all /i, v G {0, l} n , and in particular: 

• supp(K WjB (h, •)) = supp(K p (/i, -))for all h, 

• K WjB (h, •) — K p (/i, ■) for all h with supp(/i) C /, e.g., for h = (0, . . . , 0), 

• 7/p w uniformly distributed on a face of {0,l} n , then Kw,b = K p . 

The following propositions show that the set JC m ,n has the dimension expected from parameter 
counting, and that its elements are generically full rank matrices. 

Proposition 5. The set of kernels /C m , n is a multigraded toric variety. 

Remark 6. Let V denote a sufficient statistics of the n-bit independence model, e.g., a matrix 
with columns the elements of {0, l} n , and let H be the (m + 1) x 2 m -matrix with columns 
{(1, h)} he { ^i}m. Consider the exponential family Eh®v with sufficient statistics H V on 
X = {0, 1}^+™. Let X h = {(v,h f ) G X: h! = h} ^ {0, l} n for all h. For each p G S H ®v 
there is a ^ ^m,n (and vice versa) with p(-\Xh) = K WjB (h, •) \fh G {0, l} m . In particular, 

dim(/C m>n ) = (m + l)n, as expected from counting parameters. 

Proposition 7. Assume that all rows ofWe R mXn are multiples of the same vector C G W 1 , 
i.e., W = (akC)™ =1 . For almost every C and (ai, . . . , a m ) G M m ^ kernel Kyv,B is totally 
non-vanishing, i.e., all its minors are non-vanishing. 




(9) 
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Proposition 8. For any n and m the kernels Kyv,B £ fcm,n are full rank for almost all choices of 
W and B. In particular, almost every zonoset kernel Kyv,B £ ^m,n is injective when m < n, and 
dim(A 2 ™_i • K WtB ) = min{2 m - 1, 2 n - 1}. 

Example 9. Consider an RBM and a directed RBM, both with m hidden and n visible binary 
units, m < n. For almost all fixed choices of W and B, the sets of probability distributions 
{EhPhw+B(v) ^Pc(h): C £ R m } and {£ h PhW+B{v)p c {h): C £ R m } represented respec- 
tively by the two models as the bias of the hidden units vary, are almost everywhere different from 
each other (their intersection has dimension strictly less than m). When training DBNs, the DBN 
modules (directed RBMs) are commonly treated as RBMs. By this example, the probability distri- 
butions that can possibly be represented by the DBN modules almost never match the trained RBM 
distributions. 

The binary vectors {0, l} n are the vertices of the n-dimensional unit hypercube. We call edge a 
pair {x,y} C {0, l} n with dn(x,y) = 1, where dn(x,y) := \{i £ [n] : X{ ^ yi}\ denotes the 
Hamming distance between x and y. 

Proposition 10. Each of the following tuples of product distributions can be realized as a subset of 
rows of a zonoset kernel with an appropriate choice ofW and B: 

1. IfC C {0, l} m , \C\ = m + 1 are qffinely independent vectors (over W 71 ), e.g., C is a Hamming 
ball of radius 1 in {0, l} m , then {phw+B}hec are any rn + 1 product distributions. 

2. Let C be a K -dimensional face of the m-cube, K < m. The set {pnw+B^hec contains the 
uniform distributions on the (nonempty) intersections of any K faces of the n-cube. 

3. Let A C [n] := {1, . . . , n} and A C [m] with |A| — |A| — K. Let C be a K-face of the m-cube 
with free coordinates A. pnw+B is the uniform distribution on {x: x\ = h\} for all h £ C. 
Note that {x: x\ = h\)hec is a partition o/{0, l} n into blocks of cardinality 2 n ~ K . 

4. Let m = n and let {h l+ , h l ~}, i = 1, . . . , m be m disjoint edges of the m-cube. Pm+w+b is 
any distribution supported on the edge {h l+ ,h l+ 02 +e^}, and p^-w +b is any distribution 
supported on the edge {h % ~ , h % ~ © e^}, for all i £ [m]. Moreover, Phw+b — $h for all h 
VJi{h l+ , h % ~}. (This statement in fact summarizes J/71 Theorems 1 and 2]). 

Corollary 11. The model DBN(n, m, m) contains the mixture model M. n ,m+i- In contrast, 
RBM n m does not contain M n ,m+i> in general. 

3.1 Patterns of modes in zonoset tuples 

In the following we elaborate on the sets of modes that can be realized jointly by rows of zonoset 
kernels, slightly extending results on RBMs and mixtures of products shown in |[T6l . 

A mode of a probability distribution p £ A 2 ™_i is point x £ {0, l} n such that p(x) > p(y) for all y 
with dn (x, y) = 1. The set of strong modes JT6l of p is {x £ {0, l} n : p(x) > J2d H (x y )=i P(v)}- 
We denote by Tic Q A 2 ^_i the set of probability distributions with strong modes C. An n-bit code 
C is just a subset of {0, l} n . The minimum distance of C is defined as min{<i^(x, y) : x, y £ C, x ^ 
y}. Given a sign vector s £ {— , +} n , the s-orthant of R n is the set of all vectors in W 1 with sign s. 
We identify sign vectors { — , +} n and binary vectors {0,l} n via— ^0 and + \-> 1. 

Proposition 12. 

1. LetC C {0, l} n be a code of minimum distance two. If the model ZMP nrn contains a probability 
distribution with strong modes C, then there is an (m, n)-zonoset with a point in every s-orthant 
ofW 1 , seC. 

2. If ^ZMP (no, ni) contains probability distributions with 2 n °~ 1 strong modes, then n\ > no — 1. 
In fact n i > no, when Uq is odd and larger than one. 

3. If ZMP(no, ni) is a universal approximator of distributions from A2^o-i with no > 7, then 
ni > no. 

In particular, when n > 3, the DBNs with layers of widths n > n± > • • • > n\ cannot represent 
distributions with 2 n °~ 1 strong modes. If DBN(n , ni, . . . , n{) is a universal approximator with 
n± = n — 1 (and n < 6), then DBN(ni, n 2 , . . . , ni) is also a universal approximator. 
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A linear threshold code (LTC) is a subset of {0, l} n that corresponds to the sign vectors of the 
points of a zonoset in R n . Equivalently, an LTC is an admissible multi-labeling of the vertices of a 
hypercube by a collection of linear threshold functions. 

Proposition 13. Let C C {0, l} n , \C\ = 2 m be a code of minimum distance two. Then both u • JC njTn 
and A2r»_i • /C n)Tn contain a distribution with strong modes C iffC is a linear threshold code. 

Proposition 14. 

• If4\m/3] < n, then u • K n ^m n 7i n ,2 m ^ and A4 n ,k 5 u • JC njm iffk > 2 m . 

• IfA\m/3] > n, then u ■ JC njTn D 7i n ,L ^ 0, where L := mm{2 1 + ra - Z, 2 n ~ 1 }, I := max{/ G 
N: 4fZ/3"| < n), and A4 n ,k 5 u • K n ,m only ifk>L. 

3.2 Submodels of DBNs from probability sharing 

The idea of this subsection is to propagate the probability mass of distributions generated by the 
top RBM of a DBN across the network, in order to learn something about the visible probabil- 
ity distributions at the bottom. This can be accomplished by describing the products of kernels 
^n I _i,n z _ 2 ' fcrn_ 2 ,ni-3 ' ' ' ^m,n - F° r simplicity we shall consider layers of same width as the vis- 
ible layer, n. In this case the propagation can be interpreted as a process in the graph of a hypercube. 

A kernel realizes sharing of probability from a state a G {0,l} n toa state b G {0, l} n if its ath 
row has non- vanishing 6th entry. It is possible to share probability from a to a collection of states 
. . . , in arbitrary ratios by a product of I kernels iff the ath row of a product of kernels in 
K) n n can be made an arbitrary distribution on b^ , . . . , b^ s \ In particular, since all rows of zonoset 
kernels are product distributions, probability sharing from one state to more than two states, in 
arbitrary rations, is not possible in one single DBN layer. 

An l-path on the graph of the n-cube is a list S of / vectors in {0, l} n with subsequent elements 
differing in at most one bit, Si, . . . , Si G {0, l} n , dniSk, Sfc+i) < 1. An n-bit Gray code of length 
/ is a special /-path with different subsequent elements. The transition sequence T of a path is the 
list of bit-indices where the subsequent elements differ from each other (possible empty). 

Let <S(RBM njm ) denote the collection of support sets of all faces of the probability simplex A 2 ™_i, 
which are contained in RBM n;m . It is known that any union of (m + 1) edges of the n-cube is is in 
<S(RBM n;Tn ), see [021 Theorem 1]. Consider some R G <S(RBM n;n ) and a collection of /-paths S % 
starting from R, such that at any time 1 < t < I — 1 two paths change the same bit only if they are 
visiting neighboring points. We denote the collection of all such sets by 

S l n := { U ieR 5*| U, St = R G <S(RBM n , n ), T\ ± T t J unless d H (S}, S J t ) = 1} . (10) 

The following result generalizes ifTTl Lemma 1, Theorems 1 and 2] to DBNs with any number of 
layers of constant width: 

Lemma 15. The model DBN(n, . . . , n) with I hidden layers contains any probability distribution 
with support in an element ofS l n . 

For some elements of E> l n we find an explicit description: 

Proposition 16. Ifn > N(2 h + k + 1) and I > 2 2k for some k G N, then S l n contains the union ofN 
arbitrary (2 h + k + 1)- dimensional faces of the n-cube with disjoint free coordinates. In particular, 
when I > 2 n /2(n — log(n)), the entire state space {0, l} n is an element ofS> l n . 

4 Expressive power and approximation errors of DBNs 

In this section we describe some submodels of DBNs explicitly, and use them to bound the approx- 
imation errors of DBNs from above. 

Let q = {Ai , . . . , A K } be a partition of {0, l} n . The partition model M e is the set of all probability 
distributions with p(x) = p(y) whenever x and y belong to the same block Ai of the partition q. 

The following collects some results shown in the previous section: 
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Theorem 17. Let I G N. Let k be the largest natural number for which I — 1 > 2 2 , and let 
K = 2 k + k + l<n. The model DBN(n, . . . , n) with I hidden layers contains: 

• Any p G A2«-i with support contained in an element ofS l n . 

• Any partition model M e with partition q = {[y\]} yx e{o,i}K, A C [n], |A| — K. 

If K > n, then the DBN is a universal approximator, which is consistent with [Q21 Theorem 1]. 
The Kullback-Leibler divergence from a point p to a model .M in A 2 ™_i is defined as D(p\\ M) := 
mf qeM D(p\\q), where D(p\\q) := Y^xe{o,i}™ p( x ) 1o S f(fj is the divergence from p to q. The 
maximal KL-divergence [EIQIO from a partition model M e with 2 K blocks of cardinalities 2 n ~ K . 



as given in the second item of Theorem 17 is ma 1 x pe A 2n _ 1 D(p\\M Q ) = (n — K), see JT8l 
Corollary 3.1]. The Dirichlet prior on A 2 ™_i with concentration parameter et = {ct x ) xe ^^ n 

is DirQ^p) := V ^ x ^ x ^ n x p( x ) ax ~ l f° r ai l P ^ A 2 n_ 1? whereby the sums and products are 
over x G {0,l} n . Ifpis drawn from this prior, then the expected approximation error is, see ifTTl 
Theorem 4] : 



y 



a 



Ox), (ID 



E[D(p\\M e )] = (n-K)H2)+ E ~ E v 1 

xe{o,i}» y i=i y xeA. 

where /i(fe):=l + |H + ^ denotes the /cth harmonic number. 

The approximation error of a DBN is bounded from above by the approximation error of any of its 
submodels. If we use any of the partition models with 2 K blocks of cardinalities 2 n ~ K , we get: 

Theorem 18. Consider a DBN with I hidden layers of width n. 

• The maximal KL-approximation error of this model is bounded from above by 

max D(p\\ DBN) < n - K , where K = 2 k + k + l = log(2/ log(Z)). 

pGA 2 n_l 

• The expected KL-approximation error is bounded from above by eq. ( [TTj ). In particular, if p is 
drawn uniformly at random from the probability simplex A2«_i, then the expected divergence 
E[D(p\\ DBN)] is bounded from above by 1 + ln(2 n " K ) - h(2 n ~ K ). 



5 Discussion 

Deep belief networks generate mixtures of tuples of product distributions whose parameters are 
projections of hypercubes' vertices (zonosets), described by very few shared parameters. We cast 
these tuples of product distributions as the rows of stochastic matrices (zonoset kernels), and studied 
properties such as their rank, symmetries, and combinatorics. 

This analysis exposes similarities of DBNs and DBMs, and shows possible ways of defining dis- 
tributed mixtures of products; e.g., as S • JC, with a low-dimensional model S G A 2 ™_i, and a 
family of kernels JC. The rows of each kernel in the family JC can be chosen as product distributions 
with parameters equal to the projected vertices of a hypercube, or the projected vertices of any other 
low-dimensional polytope. In contrast, standard, unrestricted mixtures of products, correspond to 
projected vertices of (high-dimensional) simplices. 

Kernels are helpful for understanding probability sharing in layered networks. We showed explicit 
classes of probability distributions than can be learned by DBNs depending on the number of hidden 
layers that they contain. Various submodels of RBMs with k parameters, such as unions of partition 
models, can be learned by deep and narrow DBNs with k parameters. We showed that the maximal 
approximation error of narrow DBNs is not larger than the upper bounds on the approximation errors 
of RBMs with the same number of parameters shown in |[T8l . 

Furthermore, we bounded the expected approximation error of DBNs from above. Our bounds 
are with respect to Dirichlet priors. These priors do not only have technical advantages, but are a 
canonical choice when no information is availble about the real distribution of the targets. It could 
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be interesting to consider other priors in future work. We note in particular, that the exact expected 
error formula from Theorem [T8| item 2, eq.[TT] can be integrated over an hyperprior of interest. 



The approximation error bounds from Theorem [T8| can possibly be improved by taking into account 
the totality of DBN submodels described in this paper, instead of just partition models. It is worth 
mentioning that any DBN which is a graphical supermodel of DBN(n , n — 1, n — 2, . . . , 1) has 
the general Markov model corresponding to any tree on n leaves as a graphical submodel. That is, 
this DBN contains the union of all such tree models. Furthermore, DBNs often contain Hadamard 
products of trees as well, so it is possible to study their dimension by tropic alization | 19 1. 
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Proofs 

Geometry and combinatorics of zonoset kernels 

Proof of Proposition^ The kernel K^* = K^ Uh ^ has rows equal to the indicator functions of h 2 

[h% h £ {0, l} n , multiplied by the constant 2'^-^. Note that [h}} = e t ® 2 [h}} foralH £ [n]\I. 
For each v [n] \ T £ {0, 1}MV, the sets (vj, V[ n ]\j) 2 [h% vj £ {0, 1} J partition {0, l} n into 2^ 
cylinder sets. The connection weights W(i,j) = cx(—h* + and the bias weights 

B(j) = -a\(-h] + l)ti(j) produce the kernel 

K w , B (h,v) = exp(a^(-/i} nsupp/l + ^1/nsupp/i + h}\ BXipph - ^li\ 8VL p P h)v) / Z . 

The limit lim^oo K WjB is equal to K^ h *y To complete the proof we add the natural parameter 
vector C/c of p to the previously defined bias vector B. Then K^w+ b+c ic satisfies the claims. □ 

Proof of Proposition^ Replacing the parameters Wij, Bj with their exponentials Uij and f5j, we 

obtain a multigraded monomial map Q : C nm+n U?=i ^ n ~ 1 \ qh,v = U]=i Pj j Vu=i . 
The Zariski closure of the image of this map is a multigraded toric variety inside a product of 2 m , 
(2 n — 1) -dimensional projective spaces, one for each hidden state. This variety is cut out by a 
multigraded monomial ideal generated by the multigraded binomials appearing in the kernel. □ 

Proof of Proposition^ The rows of Kw,b are the product distributions with natural parameters the 
zonoset generated by W and B. For assessing the rank of Kw,b we may neglect the normalizing 
constants, and consider the matrix Kw,b with rows (exp((hW + B)v)) ve ^^ n , h £ {0, l} m . 

Furthermore, for any B with finite entries, the rank of Kw, B and di&g(exp(—Bv)) v -Kw,B — K\v,o 
is equal. 

Given the assumptions, the zonoset Z = {hW + B: h £ {0, l} m } is contained in a straight line 
Z = {XjC + B}^Zi> whereby the numbers Xj £ R are all different from each other, for almost 
all (afc)fc £ R m . Let (ti, . . . , £ 2 ™) := (exp(Cv)) ve {o,i} n - Note that t{ > for all i, and all t{ are 

different from each other, for almost all C £ R n . The rank of K W)j b is equal to the rank of (t^ 3 
which, after some permutation of rows and columns, is a generalized Vandermonde matrix, known 
to be totally positive. Hence det(K WjB (h, v)) heHjVeV ± for all H C {0, l} m and V C {0, l} n , 
as claimed. □ 

Proof of Proposition^ 1) First note that there is an open subset Q C R mXn x R n of parameters 
W, B for which the kernels Kw,b are full rank: Assume that the zonoset {hW + B: h £ {0, l} m } 
intersects 2 m orthants of R n , e.g., W = I n and B = |(1, . . . , 1). Then K qW)Q b is full rank for 
all a larger than some a £ R, because for a —> oo each row of K a w,aB converges to a different 
point measure. 2) Now, by Proposition|5]/C m)n is a (toric, irreducible) variety for all m, n £ N . Let 

I = minjra, n}. The set H of rank-deficient matrices inC 2 x2 ,orin 

j-j-m ^ p 2 ^ i • s a hypersurface 

cut out by the vanishing of the determinant (which is a homogeneous polynomial on the matrix 
entries). Since JCij % ^, by Q Proposition 7.1], every irreducible component of JCij H H has 
dimension dim(/Q ; z) — 1. This is also an upper bound for the dimension of the real part of the set 
of rank-deficient kernels in JCij. □ 
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Proof of Example^ Both models have the same zonoset kernels. For any choice of W and B, the 
set of inputs of the directed RBM are the product distributions M m . The set of inputs of the RBM are 
the distributions q(h) = • exp(CTi), which are product distributions iff z(h) = ^ h Zh e ^2 m -i 

is a product distribution. In it is shown that { -| J2 V exp(hWv +Bv) : W, B} is a set of dimension 
ran + n when n...m. Since is injective, each output has a unique preimage. □ 



Proof of Corollary\TJ^ By item[T]of Proposition 10 the set of distributions q G A 2 ™_i with support 



on a radius-one Hamming ball is mapped by /C m?n into the (m + 1) -mixture of product distributions 
M n ,m+i- The claim follows using that RBM n;m contains any p with | supp(p) | < m + 1, see |[T5l . 
That RBMs do not contain the mixture model is a result from lfT6l . □ 

Patterns of modes in zonoset tuples 

Proof of Proposition [72] 

1. The first item follows from JT6l Theorems 3 and 11]. 

2. For the first part: The number of strong modes of a mixture of k binary product distributions is at 
most k Il6l Theorem 3]. For the second part: If n is odd and larger than one, then the smallest mix- 
ture of binary product distributions whose natural parameters are a zonoset and which approximates 
uz± n arbitrarily well, has a zonoset generated by at least n vectors. See [021 Proposition 14]. 

3. The first part of the third item follows from parameter counting: The model J^h i exp((/&W + 
B)v)p(h), p G A 2 n-i_i has a total of n 2 + n + 2 n_1 — 1 parameters. This number is smaller than 
dim(A2^_i) = 2 n — 1 when n > 7. For the second part: Any mixture of binary product distributions 
which approximates some p with support Z± >n arbitrarily well, mixes the 2 n ~ 1 Dirac distributions 
5 V , v G Z± >n , see [14]. Hence if DBN(n , . . . , ni) approximates any distribution p with support 
Z± :U arbitrarily well, then the mixture weights (distributions from DBN(n^)) approximate p\z±, n 
arbitrarily well. □ 

Proof of Proposition \13\ This is a direct consequence of the analysis from [16). □ 

Proof of Proposition \I4\ The proof of the first item follows the lines of the proof of [16, Theo- 
rem 32]. For the second item, note that if DBN(n, m, . . .) can represent some p, then DBN(n, m + 
1, . . .) can represent Xp + (1 — A)^ for any x G {0, l} n for some < A < 1. □ 

Proof of Theorem\l5\ This result is a straightforward generalization of Qj] Lemma 1, Theorems 1 
and 2]. The elements of E> l n meet the conditions of these lemma and theorems by definition. □ 

Submodels of DBNs from probability sharing 

Proof of Proposition [76] 

1. This follows immediately from [Q21 Lemma 4]. Any sub-DBN with layers of width (n — R) is 
contained in the DBN with layers of width n. The distribution on the states of the remaining R 
visible nodes can be set to a point measure. 

2. This follows from a similar argument as the first item. Any set of cardinality (n + 1) is an g-set 
ofRBM n , n . □ 

Proof of Proposition [70| 

□ If C = {/i (0) , . . • , /i (m) } are affinely independent, then {h^-h^°\. . . , h^-h^} are linearly 
independent and can be mapped by W to an arbitrary set {W[, . . . , W m } C W 1 . Choosing 
B = B' - h^W, we can make {hW + B : h G C} be arbitrary vectors B' , W{,..., W m , and 
so, {phw+B '■ h G C} is an arbitrary set of m + 1 product distributions. 
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12 Any h can be identified with its support set. P{i} i £ [m] are m uniform distributions on arbitrary 
faces Fi of the n-cube. p\ is uniformly distributed with support argmax(^ ieA ep.). E.g., if 
Fx ■= n lEX Fi ^ 0, then supp(p A ) = F x . 

|3] This follows from the choice W-^\ = al\, the identity matrix, and W :> [ n ]\A = 0. 

HI Consider any Z £ [n]. Consider a pair of vectors {x, y} which is an edge of {0, l} m . Let r £ [m] 
be the entry where they differ. Let s £ [m] be arbitrary. Denote by x the vector X{ = XiMi ^ r, s 
and x r = 0, x s = 0. Denote by the vector with one 1 at the position i and zeros else. By 1 
the vector of ones. Choosing 

W : j = uo(2x - 1 + (1 - 2x s )me s + (p - q)e r ) 
bi = —uj(\ supp(x)| - 1 + x s m) + q 

yields in the limit uj —> oo that P(vi = h s \h ^ x,y) = 1, P(vi = l\h = x) — p, and 
P(vi = l\h = y) = q, i.e., 

P(vi\h^x,y) = S hs (vi) 
P(vi\h = x) = p(vi) 
P(vi\h = y) = q(vi) . 

Consider the case m = n. Let {x\ y 1 }^ be m disjoint edges of {0, l} m . Let s l = z Vi £ [m]. 
Consider any / £ [n]. From the above discussion we get 

n 

P(v\h = x l ) = Y[P(vi\x l ) = l[S xl Jv t )-p l (v l ) , (12) 

which is an arbitrary distribution with support on the edge given by fixing Vi = x\ Vi ^ I. For 

h U^ 1 {x\ y 1 } and s l = we get 

n 

P(v\h^x l ,y l Vl) = Y[P(v i \h) = l[S hsi (v i ) = 5 h (v) , (13) 

i=l i 

which is the point measure on {v = h}. □ 

Example 19. Figure [4] gives an example of zonoset kernels Kw,b = K p in £4,4 for p the uniform 
distributions on faces of {0, l} 4 . 
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Figure 4: The kernels K p for p the uniform distributions on faces of {0, l} 4 of dimension zero (first 
line), one (the next four lines; one line for each possible edge orientation), two (the next six lines; 
one for each pair in {1, 2, 3, 4}), three (the next four lines), and four dimensional (p is the uniform 
distribution on {0, l} 4 ). The first row of each kernel is always equal to the probability distribution p. 
The rows and columns of each kernel are in the lexicographical order of {0, l} 4 . By Proposition |4| 
all these kernels are contained in the family £4,4. 
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